Within the housing industry, there is a subset that rents out housing to other people. AirBnb, a large company in this area, has asked us to help them be able to predict rental prices given recent fears about a crash. We will provide AirBnb with a working model to help predict rental prices. For this scenario, our object of analysis will be an individual AirBnb rented home in the San Francisco Area, the population will be all AirBnb rentals in the San Francisco area from about 2012-present day, and our response variable will be the price of the AirBnb rental. All data collected in our sample was done by Dr. Greatrex.
Our initial thoughts believe that there are potentially 2 factors driving rental prices. One pertains to external factors separate from the residence such as crime. The other pertains to the actual house itself such as being able to work from home, number of reviews, and more.
| Variable Name | Description |
| host_name | The name of the person renting out the room/house |
| neighborhood | neighborhood name in San-Francisco |
| room_type | Entire home or individual room |
| price | The price that the airbnb rented for per night in USD |
| minimum_nights | The minimum number of nights that a guest must stay in order to book |
| number_of_reviews | Total number of reviews for that property |
| review_per_month | Average number of reviews per month |
| Number_Listings_by_Host | One means that this is the only airbnb rented by that host (e.g. likely their own home), several means that they might be a professional short term letter |
| availability_365 | How many days the property is available during each year |
| num_trees.500m | Number of trees within 500m of the property |
| num_bikehire.500m | Number of bike rental stations within 500m of the property (a measure of tourism) |
| num_public_art.500m | Number of public art works and murals within 500m of the property (a measure of tourism) |
| num_burglary.500m | Number of burglaries within 500m of the property during that month/year |
| num_motor_theft.500 | Number of car thefts within 500m of the property during that month/year |
| CensusTract_GEOID | The census tract that the property is situated in |
| CT_population_density.km2 | The population density in the census tract/neighborhood (people per square km) |
| CT_median_age | The median age in the census tract/neighborhood (years) |
| CT_percent.incomeGt75E | The percent of people who make more than $75,000 per year in that tract/neighborhood (years) e.g. measure of wealth |
| CT_percent.under18 | The percentage of people under the age 18 in the census tract/neighborhood |
| CT_percent.over65 | The percentage of people over age 65 in the census tract/neighborhood |
| CT_percent.poverty | The percent of people in poverty in the census tract/neighborhood |
| CT_percent.foreignborn | The percent of foreign born people in the census tract/ neighborhood |
| CT_percent.workhome | The percent of people who work from home in that tract/neighborhood |
| CT_percent.withdegree | The percent of people with a degree in that tract/neighborhood |
| CT_percent.collegestudents | The percent of college students in that tract/neighborhood |
| CT_gini_inequalityindex | The Gini inequality index in that tract/neighborhood |
| CT_median.housevalue | The median house value in that tract/neighborhood |
| CT_median.rent | The median rent in that tract/neighborhood |
| CT_percenthouse.rented | The percentage of house rented out in that tract/neighborhood |
| CT_percenthouse.vacant | The percentage of vacant houses in that tract/neighborhood |
| CT_percenthouse.broadband | The percentage of houses with broadband internet in that tract/neighborhood |
| Longitude | degrees |
| Latitude | degrees |
room <- read_excel("AirBnb_ROOM_sample_jxh6215.xlsx")
## New names:
## • `` -> `...1`
house <- read_excel("AirBnb_HOUSE_sample_jxh6215.xlsx")
## New names:
## • `` -> `...1`
room.sf <- st_as_sf(room,coords=c("Longitude","Latitude"),crs=4326)
house.sf <- st_as_sf(house,coords=c("Longitude","Latitude"),crs=4326)
skim(room)
| Name | room |
| Number of rows | 150 |
| Number of columns | 34 |
| _______________________ | |
| Column type frequency: | |
| character | 6 |
| numeric | 28 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| …1 | 0 | 1 | 1 | 3 | 0 | 150 | 0 |
| host_name | 0 | 1 | 1 | 16 | 0 | 126 | 0 |
| neighbourhood | 0 | 1 | 6 | 21 | 0 | 31 | 0 |
| room_type | 0 | 1 | 11 | 11 | 0 | 1 | 0 |
| Number_Listings_by_Host | 0 | 1 | 3 | 7 | 0 | 2 | 0 |
| CensusTract_GEOID | 0 | 1 | 11 | 11 | 0 | 105 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| price | 0 | 1 | 204.15 | 115.81 | 33.00 | 123.50 | 174.00 | 259.50 | 725.00 | ▇▆▂▁▁ |
| minimum_nights | 0 | 1 | 17.90 | 17.94 | 1.00 | 2.00 | 30.00 | 30.00 | 120.00 | ▇▇▁▁▁ |
| number_of_reviews | 0 | 1 | 57.13 | 101.16 | 1.00 | 4.00 | 10.00 | 49.00 | 450.00 | ▇▁▁▁▁ |
| reviews_per_month | 0 | 1 | 1.24 | 1.67 | 0.02 | 0.16 | 0.42 | 1.68 | 7.67 | ▇▁▁▁▁ |
| availability_365 | 0 | 1 | 155.38 | 117.05 | 0.00 | 44.75 | 145.50 | 276.75 | 365.00 | ▇▅▅▅▃ |
| num_trees.500m | 0 | 1 | 22.58 | 13.81 | 0.00 | 12.25 | 20.00 | 30.00 | 74.00 | ▆▇▅▁▁ |
| num_bikehire.500m | 0 | 1 | 51.00 | 48.32 | 0.00 | 13.25 | 38.00 | 71.50 | 224.00 | ▇▂▂▁▁ |
| num_burglary.500m | 0 | 1 | 7.71 | 6.92 | 0.00 | 2.00 | 6.00 | 11.00 | 29.00 | ▇▆▂▁▁ |
| num_public_art.500m | 0 | 1 | 10.49 | 15.55 | 0.00 | 1.00 | 3.00 | 17.00 | 69.00 | ▇▂▁▁▁ |
| num_motor_theft.500 | 0 | 1 | 8.32 | 7.67 | 0.00 | 3.00 | 7.00 | 11.00 | 43.00 | ▇▃▁▁▁ |
| CT_population_density.km2 | 0 | 1 | 12476.83 | 10001.21 | 2236.18 | 7430.65 | 10017.02 | 13831.31 | 75415.06 | ▇▁▁▁▁ |
| CT_median_age | 0 | 1 | 40.30 | 6.03 | 30.00 | 35.80 | 40.30 | 43.40 | 68.30 | ▆▇▃▁▁ |
| CT_percent.incomeGt75E | 0 | 1 | 0.41 | 0.14 | 0.04 | 0.32 | 0.41 | 0.52 | 0.69 | ▂▂▇▆▃ |
| CT_percent.under18 | 0 | 1 | 0.13 | 0.05 | 0.01 | 0.10 | 0.13 | 0.16 | 0.28 | ▂▆▇▃▁ |
| CT_percent.over65 | 0 | 1 | 0.16 | 0.07 | 0.02 | 0.11 | 0.15 | 0.21 | 0.55 | ▅▇▂▁▁ |
| CT_percent.poverty | 0 | 1 | 0.45 | 0.11 | 0.24 | 0.36 | 0.44 | 0.52 | 0.71 | ▃▇▇▅▁ |
| CT_percent.foreignborn | 0 | 1 | 0.30 | 0.12 | 0.10 | 0.21 | 0.28 | 0.37 | 0.74 | ▇▇▅▂▁ |
| CT_percent.workhome | 0 | 1 | 0.15 | 0.06 | 0.00 | 0.10 | 0.14 | 0.19 | 0.33 | ▂▇▇▆▁ |
| CT_percent.withdegree | 0 | 1 | 0.59 | 0.19 | 0.11 | 0.49 | 0.61 | 0.74 | 0.94 | ▂▂▇▇▅ |
| CT_percent.collegestudents | 0 | 1 | 0.04 | 0.03 | 0.00 | 0.02 | 0.03 | 0.05 | 0.19 | ▇▅▁▁▁ |
| CT_gini_inequalityindex | 0 | 1 | 0.47 | 0.05 | 0.35 | 0.43 | 0.46 | 0.51 | 0.61 | ▁▇▇▅▁ |
| CT_median.housevalue | 0 | 1 | 690.53 | 470.81 | 0.00 | 368.25 | 636.50 | 896.00 | 2273.00 | ▆▇▂▁▁ |
| CT_median.rent | 0 | 1 | 2479.16 | 654.57 | 508.00 | 2053.00 | 2525.50 | 2984.25 | 3501.00 | ▁▁▆▇▆ |
| CT_percenthouse.rented | 0 | 1 | 0.53 | 0.18 | 0.04 | 0.41 | 0.54 | 0.66 | 0.93 | ▂▃▇▇▂ |
| CT_percenthouse.vacant | 0 | 1 | 0.11 | 0.07 | 0.00 | 0.06 | 0.10 | 0.16 | 0.40 | ▇▇▅▁▁ |
| CT_percenthouse.broadband | 0 | 1 | 0.82 | 0.09 | 0.53 | 0.78 | 0.85 | 0.89 | 0.98 | ▁▂▃▇▃ |
| Longitude | 0 | 1 | -122.43 | 0.03 | -122.51 | -122.45 | -122.43 | -122.41 | -122.39 | ▂▃▆▇▆ |
| Latitude | 0 | 1 | 37.76 | 0.02 | 37.71 | 37.75 | 37.76 | 37.78 | 37.81 | ▃▃▇▇▅ |
gghistostats(price,data=room,results.subtitle = FALSE)
You can see that we do not have any data that is missing from this file. You can also see that the response variable, price, seems to be slightly skewed. We will run a correlation matrix next to narrow down our predictor variables that we will use. We will also make a tmap in order to identify any noticeable trends regarding geo-location and price.
room.numeric <- room[ , sapply(room,is.numeric)]
corrplot(cor(room.numeric),
method="ellipse",
type="lower",
tl.cex = 0.5,
number.cex =1,
tl.srt = 45,
mar= c(0,0,0,0),
diag = FALSE)
tmap_mode("view")
## tmap mode set to interactive viewing
qtm(room.sf,dots.col="price",title="AirBnB Price",dots.size=.5,dots.palette="Blues")
Of the 34 variables that we are analyzing, 28 of those variables are numeric and 6 are categorical. The correlation matrix conducted suggests that working from home and minimum nights the tenant has to stay seems to have the highest correlation coefficients of the other variables. There also seems to be some overlap between variables as the have correlations with each other. These include crime statistics correlating to median house value and so on. Some variables being related to each other is expected and so we will try to use variables that are no as correlated to other variables. It is also important to note that the location map does not seem to show any true patterns between the location and price.
We now will make and compare two single predictor, simple linear regression models. The first variable that we will look at is the percentage of people that work from home. This variable showed a pretty good correlation factor with only being related to a few other variables. The other variable that we will take a look at is the percentage of people making more than $75,000. This also showed a good correlation and will be interesting to model.
percent_home__plot <- ggplot(room, aes(x=CT_percent.workhome, y=price)) +
geom_point() +
geom_smooth(method=lm , color="red", se=FALSE) +
theme_ipsum() +
xlab("Percentage Working from Home") +
ylab("Price") +
ggtitle("Scatterplot Comparing Price to Home Work Percentage")
percent_home__plot
## `geom_smooth()` using formula = 'y ~ x'
LM_PercentHome <- lm(price~CT_percent.workhome,data = room)
LM_PercentHome
##
## Call:
## lm(formula = price ~ CT_percent.workhome, data = room)
##
## Coefficients:
## (Intercept) CT_percent.workhome
## 80.63 832.71
\[ \hat{price} = 80.63 +832.71(PercentWorkingFromHome) \]
ols_plot_resid_fit(LM_PercentHome)
ols_plot_resid_qq(LM_PercentHome)
ols_plot_resid_hist(LM_PercentHome)
ols_test_normality(LM_PercentHome)
## Warning in ks.test.default(y, "pnorm", mean(y), sd(y)): ties should not be
## present for the Kolmogorov-Smirnov test
## -----------------------------------------------
## Test Statistic pvalue
## -----------------------------------------------
## Shapiro-Wilk 0.907 0.0000
## Kolmogorov-Smirnov 0.1019 0.0885
## Cramer-von Mises 12.7925 0.0000
## Anderson-Darling 2.5881 0.0000
## -----------------------------------------------
It is clear that this model fails the LINE assumption tests. The model may not be linear and the errors are not normally distributed or of equal variance. We will apply a transformation to try and fix this model.
#adding in transformation columns
room$log_workhome <- log(room$CT_percent.workhome)
room$log_price <- log(room$price)
room$sqrt_workhome <- sqrt(room$CT_percent.workhome)
room$inv_workhome <- 1/(room$CT_percent.workhome)
room$sqrt_price <- sqrt(room$price)
room$inv_price <- 1/(room$price)
newroom <- room
which(room$CT_percent.workhome == min(room$CT_percent.workhome))
## [1] 64
newroom <- newroom[-64,]
LM_log_price_workhome <- lm(log_price~CT_percent.workhome,data=newroom)
ols_plot_resid_stud_fit(LM_log_price_workhome)
summary(LM_log_price_workhome)
##
## Call:
## lm(formula = log_price ~ CT_percent.workhome, data = newroom)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.35308 -0.34492 0.02056 0.31837 1.37310
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.5937 0.0962 47.752 < 2e-16 ***
## CT_percent.workhome 3.9579 0.5930 6.674 4.75e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4587 on 147 degrees of freedom
## Multiple R-squared: 0.2325, Adjusted R-squared: 0.2273
## F-statistic: 44.54 on 1 and 147 DF, p-value: 4.751e-10
LM_sqrt_price_workhome <- lm(sqrt_price~CT_percent.workhome,data=newroom)
ols_plot_resid_stud_fit(LM_sqrt_price_workhome)
summary(LM_sqrt_price_workhome)
##
## Call:
## lm(formula = sqrt_price ~ CT_percent.workhome, data = newroom)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.8138 -2.4850 -0.1822 1.9163 12.9002
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.6401 0.6854 14.065 < 2e-16 ***
## CT_percent.workhome 28.0230 4.2253 6.632 5.91e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.268 on 147 degrees of freedom
## Multiple R-squared: 0.2303, Adjusted R-squared: 0.2251
## F-statistic: 43.99 on 1 and 147 DF, p-value: 5.905e-10
LM_inv_price_workhome <- lm(inv_price~CT_percent.workhome,data=newroom)
ols_plot_resid_stud_fit(LM_inv_price_workhome)
summary(LM_inv_price_workhome)
##
## Call:
## lm(formula = inv_price ~ CT_percent.workhome, data = newroom)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.004907 -0.002121 -0.000561 0.001486 0.021871
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0099807 0.0006874 14.520 < 2e-16 ***
## CT_percent.workhome -0.0239590 0.0042374 -5.654 7.9e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.003278 on 147 degrees of freedom
## Multiple R-squared: 0.1786, Adjusted R-squared: 0.173
## F-statistic: 31.97 on 1 and 147 DF, p-value: 7.9e-08
We tried different combinations of transforming the predictor variable, the response variable, and both. Most of the options do show any promise as they would either fail the LINE test or have too low of a correlation coefficient. The combination that seems the most promising for this variable is taking the log of the price with the normal predictor. It is important to note that we had to remove a data point that contained a “0” in the percentage working from home. This 0 would have complicated our transformations. We will now analyze our new first model.
percent_home__plot2 <- ggplot(newroom, aes(x=CT_percent.workhome, y=log_price)) +
geom_point() +
geom_smooth(method=lm , color="red", se=FALSE) +
theme_ipsum() +
xlab("Percentage Working from Home") +
ylab("Log of Price") +
ggtitle("Scatterplot Comparing Log of the Price to Home Work Percentage")
percent_home__plot2
## `geom_smooth()` using formula = 'y ~ x'
LM_log_price_workhome
##
## Call:
## lm(formula = log_price ~ CT_percent.workhome, data = newroom)
##
## Coefficients:
## (Intercept) CT_percent.workhome
## 4.594 3.958
\[ log(\hat{price}) = 4.594 +3.958(PercentageWorkFromHome) \]
ols_plot_resid_stud_fit(LM_log_price_workhome)
ols_plot_resid_qq(LM_log_price_workhome)
ols_test_normality(LM_log_price_workhome)
## Warning in ks.test.default(y, "pnorm", mean(y), sd(y)): ties should not be
## present for the Kolmogorov-Smirnov test
## -----------------------------------------------
## Test Statistic pvalue
## -----------------------------------------------
## Shapiro-Wilk 0.9959 0.9523
## Kolmogorov-Smirnov 0.0404 0.9684
## Cramer-von Mises 18.3092 0.0000
## Anderson-Darling 0.2347 0.7893
## -----------------------------------------------
ols_test_f(LM_log_price_workhome)
##
## F Test for Heteroskedasticity
## -----------------------------
## Ho: Variance is homogenous
## Ha: Variance is not homogenous
##
## Variables: fitted values of log_price
##
## Test Summary
## -------------------------
## Num DF = 1
## Den DF = 147
## F = 0.5457897
## Prob > F = 0.4612211
This model has met every metric of our LINE assumption. The model seems to be linear from the residual plot. The model also seems to match up well with the QQ plot, with small deviations at both far ends. We can assume that independence is met and has met the equal variance requirement. The corresponding p-value testing for variance was about 46.12%.
ols_plot_resid_lev(LM_log_price_workhome)
Since our model met LINE assumptions, we can check for outliers now. From the plot, it is noticeable that there are a lot of leverage points and some outliers, but none are influential points. We can now move onto our second model.
percent_75k__plot <- ggplot(room, aes(x=CT_percent.workhome, y=price)) +
geom_point() +
geom_smooth(method=lm , color="red", se=FALSE) +
theme_ipsum() +
xlab("Percentage Making above $75,000") +
ylab("Price") +
ggtitle("Scatterplot Comparing Price to those making above $75,000 Percentage")
percent_75k__plot
## `geom_smooth()` using formula = 'y ~ x'
LM_price_75k <-lm(price~CT_percent.incomeGt75E,data = room)
LM_price_75k
##
## Call:
## lm(formula = price ~ CT_percent.incomeGt75E, data = room)
##
## Coefficients:
## (Intercept) CT_percent.incomeGt75E
## 94.11 267.69
\[ \hat{price} = 94.11 +27.69(PercentMakingMoreThan75k) \]
ols_plot_resid_stud_fit(LM_price_75k)
ols_plot_resid_qq(LM_price_75k)
ols_plot_resid_hist(LM_price_75k)
ols_test_normality(LM_price_75k)
## Warning in ks.test.default(y, "pnorm", mean(y), sd(y)): ties should not be
## present for the Kolmogorov-Smirnov test
## -----------------------------------------------
## Test Statistic pvalue
## -----------------------------------------------
## Shapiro-Wilk 0.8605 0.0000
## Kolmogorov-Smirnov 0.1466 0.0032
## Cramer-von Mises 15.44 0.0000
## Anderson-Darling 5.1106 0.0000
## -----------------------------------------------
Although it seems that this would be a good model, it does fail the normality portion of our LINE assumptions as equal variance. We will try new transformations to the formula to make it better. We will used the same transformations as before with those being log, square root, and inverse functions.
#adding in transformation columns
room$log_75k <- log(room$CT_percent.incomeGt75E)
room$sqrt_75k <- sqrt(room$CT_percent.incomeGt75E)
room$inv_75k <- 1/(room$CT_percent.incomeGt75E)
LM_log_price_75k <- lm(log_price~CT_percent.incomeGt75E,data=room)
ols_plot_resid_stud_fit(LM_log_price_75k)
summary(LM_log_price_75k)
##
## Call:
## lm(formula = log_price ~ CT_percent.incomeGt75E, data = room)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.33217 -0.30898 -0.02274 0.27814 1.40910
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.5789 0.1191 38.433 < 2e-16 ***
## CT_percent.incomeGt75E 1.4650 0.2738 5.352 3.25e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4794 on 148 degrees of freedom
## Multiple R-squared: 0.1621, Adjusted R-squared: 0.1565
## F-statistic: 28.64 on 1 and 148 DF, p-value: 3.251e-07
LM_sqrt_price_75k <- lm(sqrt_price~CT_percent.incomeGt75E,data=room)
ols_plot_resid_stud_fit(LM_sqrt_price_75k)
summary(LM_sqrt_price_75k)
##
## Call:
## lm(formula = sqrt_price ~ CT_percent.incomeGt75E, data = room)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.0913 -2.3038 -0.5087 1.6584 13.1518
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.8397 0.8591 11.454 < 2e-16 ***
## CT_percent.incomeGt75E 9.6362 1.9740 4.882 2.69e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.457 on 148 degrees of freedom
## Multiple R-squared: 0.1387, Adjusted R-squared: 0.1329
## F-statistic: 23.83 on 1 and 148 DF, p-value: 2.695e-06
LM_inv_price_75k <- lm(inv_price~CT_percent.incomeGt75E,data=room)
ols_plot_resid_stud_fit(LM_inv_price_75k)
summary(LM_inv_price_75k)
##
## Call:
## lm(formula = inv_price ~ CT_percent.incomeGt75E, data = room)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.0058980 -0.0020275 -0.0003596 0.0013209 0.0214247
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0106175 0.0008206 12.938 < 2e-16 ***
## CT_percent.incomeGt75E -0.0102020 0.0018856 -5.411 2.47e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.003302 on 148 degrees of freedom
## Multiple R-squared: 0.1651, Adjusted R-squared: 0.1595
## F-statistic: 29.27 on 1 and 148 DF, p-value: 2.472e-07
We have done the same thing as the previous model, doing different combinations of transformations to the response and predictor variable. In this case, the one that seems to be linear with the highest correlation of the two is logging the price and keeping the percent of people making more than $75,000 the same. We will now analyze it further.
percent_75k__plot2 <- ggplot(room, aes(x=CT_percent.workhome, y=log_price)) +
geom_point() +
geom_smooth(method=lm , color="red", se=FALSE) +
theme_ipsum() +
xlab("Percentage Making above $75,000") +
ylab("Log of Price") +
ggtitle("Scatterplot Comparing Log of Price to those making above $75,000 Percentage")
percent_75k__plot2
## `geom_smooth()` using formula = 'y ~ x'
LM_log_price_75k
##
## Call:
## lm(formula = log_price ~ CT_percent.incomeGt75E, data = room)
##
## Coefficients:
## (Intercept) CT_percent.incomeGt75E
## 4.579 1.465
\[ log(\hat{price}) = 4.579 +1.465(PercentMakingMoreThan75k) \]
ols_plot_resid_stud_fit(LM_log_price_75k)
ols_plot_resid_qq(LM_log_price_75k)
ols_plot_resid_hist(LM_log_price_75k)
ols_test_normality(LM_log_price_75k)
## Warning in ks.test.default(y, "pnorm", mean(y), sd(y)): ties should not be
## present for the Kolmogorov-Smirnov test
## -----------------------------------------------
## Test Statistic pvalue
## -----------------------------------------------
## Shapiro-Wilk 0.9907 0.4290
## Kolmogorov-Smirnov 0.0648 0.5541
## Cramer-von Mises 19.0283 0.0000
## Anderson-Darling 0.5184 0.1852
## -----------------------------------------------
After making the transformation, this model fits our LINE assumption much better. The model is linear, is approximately normal except for deviations along the extremes, the residuals are normally distributed, and the points seem to be independent. The corresponding p-value we got from the Shapiro-Wilk test was 0.4290. We will now check for outliers.
ols_plot_resid_lev(LM_log_price_75k)
There are clearly some outliers as well as leverage points, but no influential points. We can now move on to comparing our two simple regression models.
We have been asked to compare our two models based on 5 different criteria. These include the percentage of variability explained, AIC, effect(slope) size, whether the slope is significant, and whether they met LINE assumption. Here are the models we will be using:
Model 1:\[ log(\hat{price}) = 4.594 +3.958(PercentageWorkFromHome) \]
Model 2: \[ log(\hat{price}) = 4.579 +1.465(PercentMakingMoreThan75k) \]
ols_regress(LM_log_price_workhome)
## Model Summary
## ---------------------------------------------------------------
## R 0.482 RMSE 0.456
## R-Squared 0.233 MSE 0.210
## Adj. R-Squared 0.227 Coef. Var 8.848
## Pred R-Squared 0.212 AIC 194.614
## MAE 0.369 SBC 203.625
## ---------------------------------------------------------------
## RMSE: Root Mean Square Error
## MSE: Mean Square Error
## MAE: Mean Absolute Error
## AIC: Akaike Information Criteria
## SBC: Schwarz Bayesian Criteria
##
## ANOVA
## --------------------------------------------------------------------
## Sum of
## Squares DF Mean Square F Sig.
## --------------------------------------------------------------------
## Regression 9.374 1 9.374 44.541 0.0000
## Residual 30.936 147 0.210
## Total 40.310 148
## --------------------------------------------------------------------
##
## Parameter Estimates
## ---------------------------------------------------------------------------------------------
## model Beta Std. Error Std. Beta t Sig lower upper
## ---------------------------------------------------------------------------------------------
## (Intercept) 4.594 0.096 47.752 0.000 4.404 4.784
## CT_percent.workhome 3.958 0.593 0.482 6.674 0.000 2.786 5.130
## ---------------------------------------------------------------------------------------------
summary(LM_log_price_workhome)
##
## Call:
## lm(formula = log_price ~ CT_percent.workhome, data = newroom)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.35308 -0.34492 0.02056 0.31837 1.37310
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.5937 0.0962 47.752 < 2e-16 ***
## CT_percent.workhome 3.9579 0.5930 6.674 4.75e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4587 on 147 degrees of freedom
## Multiple R-squared: 0.2325, Adjusted R-squared: 0.2273
## F-statistic: 44.54 on 1 and 147 DF, p-value: 4.751e-10
ols_regress(LM_log_price_75k)
## Model Summary
## ---------------------------------------------------------------
## R 0.403 RMSE 0.476
## R-Squared 0.162 MSE 0.230
## Adj. R-Squared 0.156 Coef. Var 9.252
## Pred R-Squared 0.141 AIC 209.078
## MAE 0.380 SBC 218.110
## ---------------------------------------------------------------
## RMSE: Root Mean Square Error
## MSE: Mean Square Error
## MAE: Mean Absolute Error
## AIC: Akaike Information Criteria
## SBC: Schwarz Bayesian Criteria
##
## ANOVA
## --------------------------------------------------------------------
## Sum of
## Squares DF Mean Square F Sig.
## --------------------------------------------------------------------
## Regression 6.581 1 6.581 28.64 0.0000
## Residual 34.009 148 0.230
## Total 40.590 149
## --------------------------------------------------------------------
##
## Parameter Estimates
## ------------------------------------------------------------------------------------------------
## model Beta Std. Error Std. Beta t Sig lower upper
## ------------------------------------------------------------------------------------------------
## (Intercept) 4.579 0.119 38.433 0.000 4.343 4.814
## CT_percent.incomeGt75E 1.465 0.274 0.403 5.352 0.000 0.924 2.006
## ------------------------------------------------------------------------------------------------
summary(LM_log_price_75k)
##
## Call:
## lm(formula = log_price ~ CT_percent.incomeGt75E, data = room)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.33217 -0.30898 -0.02274 0.27814 1.40910
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.5789 0.1191 38.433 < 2e-16 ***
## CT_percent.incomeGt75E 1.4650 0.2738 5.352 3.25e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4794 on 148 degrees of freedom
## Multiple R-squared: 0.1621, Adjusted R-squared: 0.1565
## F-statistic: 28.64 on 1 and 148 DF, p-value: 3.251e-07
Percentage of variability explained:
Model 1 has a R-squared of 0.233 while model 2 has a R-squared of 0.162. In terms of just using R-squared, model 1 is much better than model 2. Model 1 explains about 23.3% of the variation of price in this sample.
AIC:
Model 1 has an AIC of 194.614 and model 2 has an AIC of 209.078. Model 1 wins here again with the smaller AIC value.
Effect (slope) size: Since both models are just the log of the response variable and normal predictor, the slope and intercept in both models are in the same units. Model 1 has a slope of 3.958 log dollars for every 1% increase in the percentage of people that work from home while model two is about half that at 1.465 log dollars for every 1% increase in people making more than $75,000. Model 1 has the larger slope.
Is the slope significant:
After doing a t-test for the slope of model 1 and model 2, we find that both have p-values of essentially 0. This means that both slopes are significant.
Does the model meet LINE:
As done before, it is shown that both models met LINE after transformations.
After comparing the two models, we have decided to use model 1 as our model because it explains more of the variability, has a lower AIC, and the slope is higher than model 2.
We will move onto multiple linear regression models. We will begin with a “full model” that has about 10 variables and then use best subsets to narrow it down further. The ten variables that I have here are a mixture of variables that proved a higher correlation in the correlation matrix before and factors that I believe could work together. We will also be using the log of price in this model as the two previous models showed promise whenever that transformation was applied.
names(room)
## [1] "...1" "host_name"
## [3] "neighbourhood" "room_type"
## [5] "price" "minimum_nights"
## [7] "number_of_reviews" "reviews_per_month"
## [9] "Number_Listings_by_Host" "availability_365"
## [11] "num_trees.500m" "num_bikehire.500m"
## [13] "num_burglary.500m" "num_public_art.500m"
## [15] "num_motor_theft.500" "CensusTract_GEOID"
## [17] "CT_population_density.km2" "CT_median_age"
## [19] "CT_percent.incomeGt75E" "CT_percent.under18"
## [21] "CT_percent.over65" "CT_percent.poverty"
## [23] "CT_percent.foreignborn" "CT_percent.workhome"
## [25] "CT_percent.withdegree" "CT_percent.collegestudents"
## [27] "CT_gini_inequalityindex" "CT_median.housevalue"
## [29] "CT_median.rent" "CT_percenthouse.rented"
## [31] "CT_percenthouse.vacant" "CT_percenthouse.broadband"
## [33] "Longitude" "Latitude"
## [35] "log_workhome" "log_price"
## [37] "sqrt_workhome" "inv_workhome"
## [39] "sqrt_price" "inv_price"
## [41] "log_75k" "sqrt_75k"
## [43] "inv_75k"
fullmodel <- lm(log_price~reviews_per_month+number_of_reviews+availability_365+CT_percent.incomeGt75E+CT_percent.workhome+CT_percent.withdegree+CT_population_density.km2+CT_median_age+CT_percenthouse.vacant+minimum_nights,data = room)
BestSubsets <- ols_step_best_subset(fullmodel)
BestSubsets
## Best Subsets Regression
## ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
## Model Index Predictors
## ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
## 1 CT_percent.workhome
## 2 CT_percent.workhome minimum_nights
## 3 number_of_reviews CT_percent.workhome minimum_nights
## 4 number_of_reviews CT_percent.workhome CT_population_density.km2 minimum_nights
## 5 number_of_reviews CT_percent.workhome CT_population_density.km2 CT_percenthouse.vacant minimum_nights
## 6 number_of_reviews CT_percent.incomeGt75E CT_percent.workhome CT_population_density.km2 CT_percenthouse.vacant minimum_nights
## 7 number_of_reviews CT_percent.incomeGt75E CT_percent.workhome CT_percent.withdegree CT_population_density.km2 CT_percenthouse.vacant minimum_nights
## 8 number_of_reviews CT_percent.incomeGt75E CT_percent.workhome CT_percent.withdegree CT_population_density.km2 CT_median_age CT_percenthouse.vacant minimum_nights
## 9 reviews_per_month number_of_reviews CT_percent.incomeGt75E CT_percent.workhome CT_percent.withdegree CT_population_density.km2 CT_median_age CT_percenthouse.vacant minimum_nights
## 10 reviews_per_month number_of_reviews availability_365 CT_percent.incomeGt75E CT_percent.workhome CT_percent.withdegree CT_population_density.km2 CT_median_age CT_percenthouse.vacant minimum_nights
## ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
##
## Subsets Regression Summary
## ----------------------------------------------------------------------------------------------------------------------------------
## Adj. Pred
## Model R-Square R-Square R-Square C(p) AIC SBIC SBC MSEP FPE HSP APC
## ----------------------------------------------------------------------------------------------------------------------------------
## 1 0.2377 0.2326 0.2178 41.2339 194.8929 -231.6632 203.9248 31.3580 0.2118 0.0014 0.7829
## 2 0.3655 0.3568 0.3394 11.8618 169.3845 -256.5255 181.4270 26.2825 0.1787 0.0012 0.6604
## 3 0.4008 0.3885 0.3573 5.1833 162.7909 -262.7383 177.8441 24.9903 0.1710 0.0011 0.6320
## 4 0.4300 0.4143 0.381 0.0101 157.2963 -267.6819 175.3601 23.9374 0.1649 0.0011 0.6093
## 5 0.4310 0.4112 0.375 1.7747 159.0439 -265.7705 180.1183 24.0643 0.1668 0.0011 0.6165
## 6 0.4323 0.4085 0.3689 3.4351 160.6790 -263.9533 184.7641 24.1749 0.1686 0.0011 0.6232
## 7 0.4336 0.4057 0.3631 5.1304 162.3508 -262.0930 189.4465 24.2931 0.1705 0.0011 0.6302
## 8 0.4340 0.4019 0.3573 7.0296 164.2421 -260.0316 194.3484 24.4489 0.1727 0.0012 0.6383
## 9 0.4341 0.3977 0.3487 9.0081 166.2189 -257.8937 199.3359 24.6210 0.1750 0.0012 0.6468
## 10 0.4341 0.3934 0.3361 11.0000 168.2102 -255.7429 204.3378 24.7980 0.1774 0.0012 0.6555
## ----------------------------------------------------------------------------------------------------------------------------------
## AIC: Akaike Information Criteria
## SBIC: Sawa's Bayesian Information Criteria
## SBC: Schwarz Bayesian Criteria
## MSEP: Estimated error of prediction, assuming multivariate normality
## FPE: Final Prediction Error
## HSP: Hocking's Sp
## APC: Amemiya Prediction Criteria
Of the 10 different combinations of variables that we could use, the fourth model using number of reviews, percentage of people that work from home, population density, and minimum nights.
finalmodel <- lm(log_price~number_of_reviews+CT_percent.workhome+CT_population_density.km2+minimum_nights,data=room)
finalmodel
##
## Call:
## lm(formula = log_price ~ number_of_reviews + CT_percent.workhome +
## CT_population_density.km2 + minimum_nights, data = room)
##
## Coefficients:
## (Intercept) number_of_reviews
## 5.090e+00 -1.138e-03
## CT_percent.workhome CT_population_density.km2
## 3.255e+00 -9.210e-06
## minimum_nights
## -1.180e-02
Model 3:
\[ log(\hat{price}) = 5.09 -0.001138(NumberofReviews)+3.255(PercentageWorkFromHome)-0.00000921(PopulationDensity(km2))-0.018(MinimumNights) \]
ols_plot_resid_stud_fit(finalmodel)
ols_plot_resid_qq(finalmodel)
ols_plot_resid_hist(finalmodel)
ols_test_normality(finalmodel)
## -----------------------------------------------
## Test Statistic pvalue
## -----------------------------------------------
## Shapiro-Wilk 0.9927 0.6491
## Kolmogorov-Smirnov 0.0449 0.9224
## Cramer-von Mises 21.6152 0.0000
## Anderson-Darling 0.3216 0.5263
## -----------------------------------------------
ols_test_f(finalmodel)
##
## F Test for Heteroskedasticity
## -----------------------------
## Ho: Variance is homogenous
## Ha: Variance is not homogenous
##
## Variables: fitted values of log_price
##
## Test Summary
## --------------------------
## Num DF = 1
## Den DF = 148
## F = 0.02279498
## Prob > F = 0.880197
ols_plot_resid_lev(finalmodel)
This multiple linear regression model would meet every criteria of line. There does not seem to be evidence of a curve, we can assume independence, is approximately normal with slight deviations along the extremes, and the equal variance p-value we got from the f test was 0.88. This means that we cannot reject the hypothesis that the variance is homogeneous.There do seem to be outliers and leverage points along with two influential points. Those data points may need to be investigated further. Whenever we tried to remove those influential points, the new model would still have influential points in it.
summary(finalmodel)
##
## Call:
## lm(formula = log_price ~ number_of_reviews + CT_percent.workhome +
## CT_population_density.km2 + minimum_nights, data = room)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.33415 -0.24587 -0.01494 0.27183 1.14868
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.090e+00 1.122e-01 45.343 < 2e-16 ***
## number_of_reviews -1.138e-03 3.431e-04 -3.318 0.00115 **
## CT_percent.workhome 3.255e+00 5.188e-01 6.273 3.84e-09 ***
## CT_population_density.km2 -9.210e-06 3.379e-06 -2.726 0.00721 **
## minimum_nights -1.180e-02 1.945e-03 -6.067 1.08e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3994 on 145 degrees of freedom
## Multiple R-squared: 0.43, Adjusted R-squared: 0.4143
## F-statistic: 27.35 on 4 and 145 DF, p-value: < 2.2e-16
If we take a look at the summary, all the models have a p value that would be significant. The percentage of those working from home and minimum nights are a bit more significant than the number of reviews and population density. The percentage that works from home seems to have the biggest effect on the model with its slope being 3.255.